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Discriminatively Trained And-Or Graph Models 
for Object Shape Detection 

Liang Lin, Xiaolong Wang, Wei Yang, and Jian-Huang Lai 


Abstract —In this paper, we investigate a novel reconfigurable part-based model, namely And-Or graph model, to recognize object 
shapes in images. Our proposed model consists of four layers: leaf-nodes at the bottom are local classifiers for detecting contour 
fragments; or-nodes above the leaf-nodes function as the switches to activate their child leaf-nodes, making the model reconfigurable 
during inference; and-nodes in a higher layer capture holistic shape deformations; one root-node on the top, which is also an or-node, 
activates one of its child and-nodes to deal with large global variations (e.g. different poses and views). We propose a novel structural 
optimization algorithm to discriminatively train the And-Or model from weakly annotated data. This algorithm iteratively determines 
the model structures (e.g. the nodes and their layouts) along with the parameter learning. On several challenging datasets, our model 
demonstrates the effectiveness to perform robust shape-based object detection against background clutter and outperforms the other 
state-of-the-art approaches. We also release a new shape database with annotations, which includes more than 1500 challenging 
shape instances, for recognition and detection. 

Index Terms —Object Detection, Grammar Model, And-Or Graph, Structural Optimization. 
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1 Introduction 

As psychophysics experiments suggested, humans can 
successfully identify objects in images using contour 
fragments alone Il38l . In computer vision, recognizing 
object shapes from salient contours is an active research 
area. Several methods Ea, GB, Ha, m have demon¬ 
strated that the contours (silhouettes) are robust against 
variations of illumination, color, and texture. However, 
there are two long-standing difficulties in the current 
research. 

• Unreliable edge map extraction and contour tracing. 
Some key contours can be missing or connected to 
their background, making it difficult for accurately 
localizing shapes against surrounding clutter. 

• Large variations within an object category, e.g. dif¬ 
ferent object poses, views, occlusions, and defor¬ 
mations. Without using appearance or texture in¬ 
formation, this challenge might be more serious, as 
shape contours are somewhat ambiguous and less 
discriminative. 

Some recently proposed approaches addressed the 
two issues by learning hierarchical and compositional 
models, and achieved substantial progresses 1^ , 1321 , 
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M These models represent an object shape in terms of 
the parts (i.e. local contours) and the inter-part relations. 
However, their model structures (e.g. the number of 
parts and the ways of composition) are often fixed, 
consequently limiting the performances on complex sce¬ 
narios. 

In this work, we develop a novel reconfigurable part- 
based model in the form of an And-Or graph repre¬ 
sentation, which is discriminatively trained from weakly 
annotated training data (i.e. without annotating the ob¬ 
ject parts). Our model achieves superior performances 
on the task of detecting and localizing shapes from 
cluttered background, compared with other state-of-the- 
art methods. Figure shows an example of our And- 
Or graph model. The key component of our model is 
the ''switch variable", referred to the or-node, which in¬ 
corporates the compositional alternatives and makes the 
model reconfigurable. Specifically, the or-node specifies 
the way of compositions by activating the child nodes, 
to deal with the above-mentioned challenges in shape 
detection. Our And-Or graph model consists of four 
layers described as follows. 

The leaf-nodes at the bottom represent a batch of local 
classifiers that detect the salient contour fragments of 
objects. Each leaf-node is defined within a divided block, 
denoted by the red box in the bottom of Figure Given 
the edge map extracted from an image, a leaf-node takes 
the contours fallen into its block as the inputs. Once 
a long contour exceeds the block, it is automatically 
truncated. This is actually a partial matching scheme 
to handle the unreliable bottom-up edge tracing, i.e. 
to avoid object contours connecting to the background. 
Moreover, to capture the discriminability of contours, 
we design a new contour feature that combines the 
triangle-based descriptor ||2Q| and the Shape Context 
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Fig. 1. An example of our And-Or graph model. It 
comprises four layers from bottom to top: the leaf-nodes 
(denoted by the solid circles) at the bottom for localizing 
local contour fragments, the or-nodes (denoted by the 
dashed blue circles) over the bottom specifying the ac¬ 
tivations of their child leaf-nodes, the and-nodes (denoted 
by the solid squares) encoding the holistic (view-based) 
variances, and the root-node (denoted by the dashed blue 
squares) on the top to switch the selection of its child 
and-nodes. The horizontal links incorporate contextual 
interactions among parts. Note that the leaf-nodes inherit 
the links that are defined between the layer of or-nodes. 
The nodes and links in red indicate the activation of leaf- 
nodes during the detection. 

descriptor ||3l. 

The or-nodes defined as the switch variables that spec¬ 
ify the activation of their child leaf-nodes, denoted by the 
dashed blue circles in Figure During detection, each 
or-node activates one of its child lead-nodes and also 
selects the contour fragment detected by the activated 
leaf-node. The or-nodes thus represent the parts of an 
object shape, while the leaf-nodes capture all of the 
local variabilities. As Figure illustrates, our model can 
capture not only the local variations (e.g. part 2 of the 
example), but also the inconsistency caused by missing 
or broken edges (e.g. part 3 of the example). 

The collaborative edges in our model impose the con¬ 
textual information among shape contours, denoted by 
the horizontal links between the leaf-nodes in Figure 
Some of the existing compositional shape models ignore 


the contextual relations among contours, or simplify the 
relations by calculating the co-occurrence frequencies of 
neighbor contours fTbl . In contrast, we utilize informa¬ 
tive spatial layout features to define the edges, motivated 
by the methods for contextualized object detection [51, 

iQi. 

The and-nodes aggregate the local shape contours that 
have been selected via the or-nodes. Each and-node is 
defined as a potential function that captures the holistic 
shape deformations and distortions. Once the contour 
fragments are localized. The and-nodes further verify 
them as a whole to improve the discriminability of our 
model. 

The root-node at the top functions as a switch to 
choose its child and-nodes, accounting for the large 
global variations (e.g. different views of shapes). It is 
defined exactly in the same way as the or-nodes. For 
example, two horses may appear diversely under dif¬ 
ferent views, so that our model can adaptively activate 
different and-nodes for detecting them. 

From the bottom to the top, our model is hierarchically 
constructed into an "And-Or-And-Or" structure. Note 
that the leaf-nodes in our model can also be viewed as 
the and-nodes, as they are defined in the same way. This 
structure is very expressive and general to model object 
variations. The "And" symbol indicates the combination 
of sub-parts while the "Or" symbol indicates the switch 
between possible configurations. We introduce the latent 
variables to make our model reconfigurable. In particu¬ 
lar, the latent variables include the activation states of 
the or-nodes and the root-nodes, and the locations of 
contour fragments. The leaf-nodes and the and-nodes 
are defined as classification functions whose coefficients 
are treated as the observable model parameters. With the 
latent variables, the graph nodes and edges are explicitly 
mapped with the discriminative classification function of 
our model. Figure provides an intuitive illustration of 
our And-Or graph model, which will be discussed later 
on. We regard our model as a general extension of the 
pictorial and deformable part-based models p9|/ fll/ IZL 
as it incorporates not only the hierarchical decomposi¬ 
tions, but also the explicit structural alternatives. 

The training of the And-Or graph model is another 
innovation of this work. The challenges lie in two as¬ 
pects. First, multiple parameters in different layers need 
to be optimized along with the latent variables, and the 
objective function for optimization is non-convex, which 
cannot be solved directly with the traditional methods 
such as the support vector machines (SVMs). Second, it 
is non-trivial to automatically discover the model struc¬ 
tures in the model learning, as the training examples are 
not annotated into object parts. In the literature, learning 
And-Or graph models (or other reconfigurable models) 
usually relies on elaborative annotations or initializa¬ 
tions ESI, EU, ESI. To cope with these two problems, 
we propose a novel learning method, called Dynamical 
Structural Optimization (DSO), which is inspired by the 
recently proposed optimization methods IZl, EH, 14^ . 
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This algorithm iteratively optimizes the model structures 
together with the multi-layer parameter learning, which 
includes three main steps, (i) Apply current model on the 
training examples while estimating the latent variables 
for each example, (ii) Discover new model structures. As 
the model structures are mapped with the discriminative 
function of our model (see Figure [^, refactoring (rear¬ 
ranging) the feature vectors of training examples can 
lead to new structures. In brief, we perform clustering 
on the sub-feature-vectors corresponding to different 
nodes, and generate new structures according to the 
clustering results. For example, at one part of the shape, 
if the corresponding sub-feature-vectors are clustered 
into three groups, then we create three leaf-nodes accord¬ 
ingly to detect the local contours, (iii) Learn the model 
parameters with the newly generated structures. 

Shape detection using the And-Or graph model is 
realized by searching over a image pyramid. We first 
accomplish two testing steps to generate several hy¬ 
potheses of detection, and each hypothesis represents a 
configuration comprising detected contour fragments, (i) 
Local testing uses all leaf-nodes to detector contour frag¬ 
ments within the edge map. (ii) Binding testing imposes 
the collaborative edges among the contour fragments 
to further weigh the hypotheses. Afterwards, the and- 
nodes re-score each hypothesis by measuring the contour 
fragments as a whole. The root-node decides the final 
detection by selecting the most possible hypothesis. 

The remainder of this paper is organized as follows. 
Section provides a brief review of related work. Then 
we present the model representations in Section and 
follow with a description of the inference procedure in 
Section Section focuses on discussing the learning 
algorithm. The experimental results and comparisons are 
exhibited in Section]^ Sectionconcludes this paper. 

2 Related Work 

In this section, we review the extant techniques for shape 
(or contour) matching and shape model learning. 

Many methods treat shape detection as a task of 
matching contours with certain distance measures, and 
they mostly utilized hand-drawn reference templates O, 
[SI, iZl, [2, EQI, [SI, EH. To handle diverse shape 
deformations and distortions, a number of robust shape 
(or contour) descriptors have been extensively dis¬ 
cussed, such as Shape Context [3, Geodesic-Intensity 
Histogram ||T8|, Contour Flexibility ||39l, and Local An¬ 
gle 123, ||20|- Based on these shape features, several effec¬ 
tive matching schemes (TH, |[M|, 121 have been proposed 
to deal with the various challenges. For example, the 
inner-distance matching algorithm IfTSl was presented to 
handle the articulated shape deformations. Tu et al. 1341 
presented an efficient data-driven EM algorithm to iter¬ 
atively optimize shape alignment and matching corre¬ 
spondences. Felzenszwalb et al. ||8l proposed to hierar¬ 
chically match shapes using the dynamic programming 
algorithm, demonstrating good potential in capturing 


large shape deformations. An MCMC-based sampling 
algorithm was discussed in 112 to solve multi-layer 
shape matching. To overcome the problems caused by 
incomplete or noisy contours, Zhu et al. 1471 presented 
a many-to-many contour matching algorithm using a 
voting scheme. Riemenschneider et al. solved the 
partial shape matching by identifying matches from 
fragments of arbitrary length to the reference contours. 

An alternative to shape detection is addressed by 
learning shape models for a given category of shape 
instances. These methods represent shapes as a loose 
collection of local contour fragments or an ensemble of 
pairwise constraints EBl, lEI, Il32l . They usually involve 
the construction of a codebook of contour fragments 
(e.g. Groups of Adjacent Contours (GAS) |[l3) and train 
the shape models by supervised leaning. For example, 
the boosting methods were employed to train the dis¬ 
criminative classifiers with contour-based features 1281 , 
l24l . Maji et al. |23 incorporated the Hough transform 
into a discriminative learning framework, in which the 
contour words and their spatial layout were optimized 
jointly. Kokkinos and Yuille 1131 suggested hierarchically 
parsing shapes with the bottom-up and top-down com¬ 
putations, and adopted the multiple instance learning 
algorithm for model training. Another type of shape 
template is the active basis model proposed by Wu 
et al. l35l , which was trained with a shared sketch 
algorithm. 

Very recently, major progress has been made in 
appearance-based object recognition using the latent 
structure models l44l , (71, IMI, iri which the latent 
variables effectively enrich the representations. These 
methods owe their success to their ability to cope with 
deformations, occlusions, and variations. Based on these 
methods, Srinivasan et al. l32l trained the descriptive 
contour-based detector by using the latent-SVM algo¬ 
rithm, Song et al. 13T1 integrated the context information 
with the SVM-based learning, and Schnitzspan et al. lOTj 
further combined the latent discriminative learning with 
conditional random fields using multi-types of shape 
features. 

The And-Or graph was originally explored by Zhu 
and Mumford 1461 for modeling complex visual pat¬ 
terns. Its key idea, using And/Or nodes to account for 
structure reconfigurations and variabilities in hierarchi¬ 
cal composition, has been extensively applied in several 
vision tasks such as object and scene parsing 113, l37l , 
ca and event analysis |29l. However, these approaches 
often require elaborate annotations or manual initializa¬ 
tions. Si and Zhu l3Ql recently presented a framework for 
unsupervised learning of the And-Or image template, 
and demonstrated very promising results on modeling 
complex object categories. Our approach is partially mo¬ 
tivated by these works, and we target on an alternative 
way to discriminatively train the And-Or graph model 
with the non-convex optimization. Our preliminary at¬ 
tempts along this path have been discussed in 1361 , ITtI . 
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3 Representations 

In this section, we define all the components of our And- 
Or graph models, including the shape features and the 
potential functions for graph nodes and edges. 

3.1 Contour Descriptor 

First, we introduce our contour descriptor for charac¬ 
terizing local contour fragments. As Figure illustrates, 
this feature combines the triangle-based descriptor ||2Q| 
and the Shape Context ||3l, capturing local contour defor¬ 
mations with the surrounding contexts. For any contour 
fragment we extract a sequence of sample points fl, 
and for each point in Q, its triangle-based descriptor 
and Shape Context descriptor are both computed and 
concatenated into a vector. Then we pool the vectors of 
all the sample points into a histogram. 



Fig. 2. Illustration of the proposed contour descriptor. 
This feature combines the Shape Context descriptor in (a) 
and the triangle-based descriptor in (b) to characterize a 
local contour fragment. 

Given a point T G for a contour, we collect triangles 
that are formed by T and any other two A, B in Q. Note 
that each triangle is constructed by three different points. 
As Figure |^b) illustrates, the triangle-based descriptor 
for T is a 3-D histogram, denoted by EI^(T), which con¬ 
tains the angle values (e.g. ZBTA) and the two distances 
TA and TB in each dimension, respectively. We use the 
clockwise orientation to determine the triangle ZBTA, 
and the distances TB and TA are normalized by the 
average distance between the points in Q.. The Shape 
Context descriptor, denoted by EI^(T), is constructed by 
T and all other points in Q. 

In our implementation, the number of sample points 
for each contour fragment is fixed at 20 , and the 


distances between adjacent points in Q are equal. For 
each point T, (20-1) * (20 - 2)/2 = 171 triangles are thus 
collected. We define the 3-D histogram EI^(T) including 
2 bins for TA, 2 bins for TB, and 6 bins for angle 
ZBTA ranging from 0 to tt. We transform EI^(T) into 
a 2 X 2 X 6 = 24-bin 1-D feature vector. For the Shape 
Context descriptor EI^(T), we use 2 bins for lengths and 6 
bins for polar angles ranging from 0 to 27r, then its length 
is 2 X 6 = 12. By concatenating these two descriptors, we 
obtain the feature vector of T including (24 + 12) = 36 
bins. Thus, the contour fragment is represented by a 
feature vector of 36 * 20 = 702 bins. 

3.2 And-Or Graph Model 

Our model is defined in the form of an And-Or graph 
Q = (V, f), where V includes four levels of nodes and 
E includes the graph edges. The root-node is indexed as 
0 , indicating the switch among different shape views,(or 
other different global variations, by analogy). The and- 
nodes are indexed by r = 1 ,..., m, with each repre¬ 
senting one global classifier. For each and-node, there 
are a number of 2 ; or-nodes arranged in a layout of 
61 X 62 blocks to represent several object parts, and we 
index all of the or-nodes as j = m - 1 - 1 ,..., (z + 1 ) * m. 
The leaf-nodes in the fourth layer are indexed by i = 
( 2 ; + 1) * m + 1,..., ( 2 ; + 1) * m + 1 + n, where n is the 
number of leaf-nodes. For notation simplicity, we define 
m' = (z + 1) * m + 1, n' = (z + 1) * m + 1 + n, and 
i G ch{j) indicating a child node of node j. The details 
of the model Q are described as follows. 

Leaf-node: Each leaf-node Li is a local classifier for 
detecting partial shape contours. We denote the location 
of leaf-node Li as Pi, which is determined by its parent 
or-node. Given the extracted edge map X, we treat 
contour fragments within the observed block as the 
inputs of Li. For a contour c, we denote (j)\pi,c) as 
its feature vector using the proposed contour descriptor, 
and only the part of c that has fallen into the block will 
be considered. Note that we can prune some very short 
contours as noises in practice. The response of classifier 
Li located at pi is defined as: 

TZl{X,pi) = maxw- ■ 4>^{pi,c), ( 1 ) 

cG V 

where cj- is a parameter vector that is set to zero if 
the corresponding leaf-node Lj is nonexistent. We can 
thus localize the contour representing the shape part 
by c, = argmaxcex^i * This partial detecting 

scheme enables to partition true object contours from 
cluttered background. 

Or-node: The or-node Uj,j = m + 1 ,..., (z + 1 ) * m 
specifies one of its child leaf-nodes, and also the contour 
detected by the leaf-node. Every or-node is allowed to 
slightly perturb their locations with respect to the root 
in order to capture the inter-part deformations. 

For each or-node Uj, we define the deformation fea¬ 
ture, (p^ipojPj) = {dx^ dy^ dx^^ dy‘^), where (dx^dy) en¬ 
codes the displacement of the or-node position pj to the 
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expected position po determined by the root-node. The 
cost of locating Uj at pj is: 

Dj (po ,Pj)= ■ 4’" {Po ,Pj), (2) 

where is a 4-dimensional parameter vector corre¬ 
sponding to (l)^{po,Pj). 

For each leaf-node Li associated with Uj, we introduce 
an indicator variable Vi G {0,1} representing whether it 
is activated by Uj or not. We also define an auxiliary 
vector for Uj, wj = {vi}i^ch{j)f where ||vj|| = 1 or 0. Note 
that ||vj|| = 1 only when one of the leaf-nodes under 
Uj is activated. In this way, the or-node can adaptively 
activate the different leaf-nodes to capture the diverse 
local shape variance. It is worth mentioning that the cost 
of locating the or-node is independent of the selected 
leaf-nodes because we assume the leaf-nodes belong to 
the same part (i.e. or-node) act a nearby location. 

Thus, the response of the or-node Uj is defined as, 

TVi {X,po, Pj ,Vj )= y] TZliX, Pj )■ Vi- Dj {po ,Pj). (3) 

iech(j) 


1 


1 
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Fig. 3. The spatial contextual features defined for the 
collaborative edges. 

And-node: The and-node, Ar, performs a global ver¬ 
ification for the whole shape. For each and-node, we 
have a set of contour fragments, Cr = {ci, C2,..., C;^}, 
which are determined by its child or-nodes. Then we 
adopt the Shape Context descriptor |3| to describe these 
contours as a whole, Thus, we define the and- 

node's response as, 

K{Cr)=UJ--r{Cr)^ (4) 

where is the corresponding parameter vector. 

Collaborative Edge: We impose contextual interac¬ 
tions among shape parts based on the collaborative 
edges. Given any two different or-nodes associated with 
the same and-node, we link an edge between them and 
their child leaf-nodes inherit the edge. We define the 
collaborative edges using the spatial contextual features, 
as Figure [^illustrates. 

Suppose one edge connects two leaf-nodes 
are located at pi and pi> respectively. We collect a 4- 
bin feature 'ipiPi^Pi') for the two leaf-nodes according to 
their spatial layout. Each bin of i^{pi,Pi') represents one 
of the four relations of {Li,Li>): clockwise, anti-clockwise, 
near, and far. In Figure [^ the bold rectangle in the center 
indicates the location of Li, which is connected to the red 
bold rectangle indicating the location of . The dashed 


line represents the initial layout of the two leaf-nodes, 
and the red solid line is the adjusted actual layout in 
detection. Specifically, we define the relations as 

• Near and Far: If Li> falls into the outer dashed 
rectangle, it is near to Li, i.e. the bin of near is 
activated (i.e. being set as 1); otherwise it is far from 
Li. 

m Clockwise and Anti-clockwise: One of the two relations 
is activated (i.e. being set as 1) according to the angle 
between the dashed line and the solid red line. 

These relations intuitively encode the spatial contexts 
of two leaf-nodes {Li,Li>). Let {v^} represent the acti¬ 
vation variables of the leaf-nodes, and we denote P as 
a vector of the positions of all or-nodes Uj. P also spec¬ 
ifies the locations {pi} of the activated leaf-nodes. The 
response of the collaborative edge is then parametrized 
as 

y y 'p i^ii,i’)-'ip{Pi,Pi')-vi-vi>, 

jech{r) iech{j) i'ed{i) 

(5) 

where d{i) represents the set of neighbor leaf-nodes of 
Li, and each neighbor has a different parent node with 
Li. is the corresponding weight. Vi and Vi' are the 

activation indicators for Li and L^/, respectively, as the 
edges are imposed only for the activated leaf-nodes. 

Root-node: The root-node on the top alternatively 
activates one of its child and-nodes, whose definition is 
similar with that of the or-node. Also, we use a variable 

^ {0,1} to specify the activation of each and-node Ar, 
and the indicator vector for the root-node is vq = {vr}fLi 
and ||vo|| = 1, i.e. only one child is selected. 

Let P imply the part-based deformation with or- 
nodes, and V = (vo,{vj}) imply the selection of and- 
nodes and leaf-nodes, the overall response of our model 
is then defined as: 

nG{x,p,v) = 

m 

J2vr-{ y n^{X,P 0 ,Pj,Vj)+K{P,{^j})+KiCr)). 

r=l jC:ch{r) 

( 6 ) 

In this model, H = {P^V) are the latent variables 
that will be adaptively estimated in testing. For notation 
simplicity, our model in Equation (|^ can be re-written 
as : 

nG{X,H)=w-f{X,H), (7) 

where 0(X, H) represents the concatenated feature vec¬ 
tor for all nodes and edges in the model, and uj includes 
all of the parameters corresponding to 0(X, H). Figure [^ 
illustrates our And-Or graph model mapped with the 
discriminative function. 

We summarize the symbols used in our model in Table 

HI 
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Fig. 4. Mapping the latent And-Or graph with the discriminative function defined in Equation Different layers of 
nodes in our model are associated with certain bins in the feature vector 4>{X, H) (at the bottom). The activated leaf- 
nodes are highlighted in red, and the feature bins are set to zeros for the other inactivated nodes. The embedded 
latent variables H = (P, V) make our model reconfigurable during detection. 


Symbol 


Meaning 


{^r}r=l 

STT 

Jj=m+1 
i- J i=m' 

X 

P = {PO,Pj,Pi} 

mx,Pi) 

'Pj(X,P0,Pj,Vj) 

7e“(Cr) 

n°(x,p,v) 

H = (P, V) 


The and-nodes. 

The or-nodes. 

The leaf-nodes. 

The edge map of an image. 

The locations of the root-node po, or-nodes pj, and leaf-nodes pi. 

The response of the classifier associated with leaf-node Li located at pi . 

The response of the or-node. Vj indicates the selection of its child leaf-nodes. 

The response of the and-nodes, which provides a global verification for the shape Cr- 
The response of the collaborative edges. {Vj } indicates the selection of the leaf-nodes. 
The response of the whole model, where P and V represent the latent variables. 

All latent variables (including positions P and activation variables V) of our model. 


TABLE 1 

Notation summary of this work. 



Fig. 5. Illustration of the inference procedure, (a) shows local testing for detecting contour fragments within the edge 
map; the blue dashed boxes represent perturbed blocks associated with the leaf-nodes, (b) shows a hypothesis of 
detection including candidates (indicated by the red boxes) proposed by all or-nodes, in which the collaborative edges 
are imposed, (c) shows the global verification, in which the ensemble of contours are measured as a whole. 


4 Inference 

Given the edge map X extracted from the image, the 
inference task is to detect the optimal contour fragments 
within the detection window scanned over an image 
pyramid. The detection is a search procedure to acti¬ 
vate nodes from bottom to top, in which a number of 
hypotheses are generated and each one specifies a con¬ 


figuration of detected contour fragments. We verify the 
hypotheses and prune the unlikely ones by maximizing 
the model response defined in Equation (0. 

We conduct the inference algorithm with the following 
steps. An example illustrating the inference procedure 
using our model is presented in Figure 

Local testing: We use all of the leaf-nodes (i.e. the local 
contour classifiers) to search for optimal contour frag- 
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merits within the edge map X. Assume that one or-node 
Uj, associated with a partitioned block in the detection 
window, contains a number of leaf-nodes {Li^i G ch{j)}, 
and that the initial position of Uj is p'. Each Uj is allowed 
to slightly perturb its location. At each location p' we 
treat all of the contours that have fallen into the block 
as the inputs to every leaf-node of Uj, as Figure [^a ) 
illustrates. By maximizing the response in EquationT^, 
each leaf-node Li G ch{j) can find an optimal contour at 
a certain location. Recall that each or-node can activate 
only one of the child leaf-nodes. Thus, the possibility 
of different leaf-node selections can generate a batch of 
detection hypotheses. In particular, we denote H as the 
latent variables for one hypothesis, and denote {'Vj.pj) 
for a possible activation of Uj, where Vj indicates the 
leaf-node selection and pj is the location. The cost of 
i'Vj.pj) is then measured by the function TZ'j defined in 
Equation 

Binding testing: The hypotheses from the local test¬ 
ing are further weighed and filtered by imposing the 
collaborative edges. In each hypothesis, each or-node 
proposes one leaf-node, and any two leaf-nodes derived 
from different or-nodes are connected by an edge. We 
measure the score by the potential function in Equation 

In this way, each detection hypothesis is scored by the 
two testing steps, as, 

SliX,H)= n]{X,po,Pj,Vj)+K{P,{^j}), (8) 

jech{r) 

where P = {pj} denotes the locations of all of the or- 
nodes. In practice, we can prune some of the hypotheses 
by setting a threshold on the score. 

Global verification: In this step, we apply the and- 
nodes to re-score the hypotheses of detection. For any 
hypothesis, we obtain an ensemble of contours, Cr = 
{ci, C2 ,..., C;^}, each of which is proposed by one or- 
node. We can measure the contours as a whole by 
S3{X,H) = 7^“(d) in Equation as Figure gc) 
illustrates. 

Afterwards, the root-node determines the optimal de¬ 
tection by selecting the maximum aggregated score, as 

m 

= argmax + • Vo, (9) 

H ^ 

where ||vo|| = 1 constrains only one of the and-nodes 
selected by the root-node. 

The overall inference procedure appears in Algorithm 

m 

5 And-Or Graph Learning 

We formulate the And-Or graph model learning as a 
joint optimization task of model structures and param¬ 
eters. To achieve this goal, we present a novel struc¬ 
ture learning algorithm extended from the existing non- 
convex optimization methods Il43t , (361 . This algorithm 


Algorithm 1: Inference with the And-Or graph rep¬ 
resentation_ 

Input: 

X : the edge map extracted from the test image. 

Output: 

H*: the optimal detection with the maximal detection score 

nG(x,H*). 

Local testing: 

1. Apply leaf-nodes to detect all possible local contour fragments. 

2. Generate a batch of detection hypotheses via the or-nodes. 

Binding testing: 

1. Impose the collaborative edges between leaf-nodes in each 
detection hypothesis. 

2. Score the hypotheses by Equation 

3. Prune unlikely hypotheses by threSolding the score. 

Global verification: 

1. For each hypothesis, the local contours are measured as a whole 
via the and-nodes. 

2. Aggregate all potentials via the root-node in Equation j^. 

3. Merge results by non-maximum suppression over all image 
positions and scales. 


optimizes the objective in a dynamical manner: the latent 
structures H = {P,V) are iteratively determined along 
with the parameter learning in each step. For example, 
new leaf-nodes are created or removed to better adapt 
to the training data by adjusting the latent variables. 
One instance of our learning procedure is illustrated in 
Figure]^ from (a) to (b), a leaf-node associated with Ui 
is removed and a new leaf-node under Uq is created in 

(c). 

5.1 Optimization Formulation 

Suppose we have a set of positive and negative training 
samples {Xi,yi),...,{XN,yN), where X is the edge map 
and ^ = zbl is the label indicating positive and negative 
samples. We assume that the samples indexed from 1 to 
K are the positive samples, and that the feature vector 
for each sample (X, y) is, 

where H represents the latent variables and (p{X, H) the 
overall feature vector of the And-Or graph model. Then 
we pose the And-Or graph learning as optimizing model 
parameters along with the latent structures, 

UJ = argmaXy^ni^ * ^))- (H) 

We further transfer this target into a maximum margin 
formulation, 

1 ^ 

min - ||w||^ + A y'[max(w • y, H) + C{yk, y, H)) 

-m&x{uj-(p{Xk,yk,H))], (12) 

11 

where A is a penalty weight (set as 0.005 empirically), 
and C{yk^y,H) is the loss function. In our implemen¬ 
tation, we define that C{yk^y^H) = 0 H yj. = y^ and 1 
otherwise. 
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Fig. 6. Illustration of the structure reconfiguration. Parts of the model, two or-nodes {Ui,Uq), are visualized in three 
intermediate steps, (a) The initial structure, i.e. the regular layout of an object. Two new structures are dynamically 
generated during the iterations, (b) A leaf-node associated with Ui is removed, (c) A new leaf-node is created and 
assigned to U^. 


The target energy in Equation is non-convex mak¬ 
ing it difficult to be solved analytically. In this work, we 
propose the Dynamical Structural Optimization (DSO) 
method to iteratively optimize this objective based on 
the Concave-Convex Procedure (CCCP) method na. 

5.2 Dynamical Structural Optimization 

Following the CCCP method |l3l, we convert the objec¬ 
tive function in Equation | |T^ into a convex and concave 
form as, 

1 ^ 

||w||2 -f A V ■ (f>{Xk,y, H) C{yk,y, H))] 

OJ 2 ^^ y,H 

k=l 

N 

(13) 

k=l 

= min[/(w)(14) 

UJ 

where f{uj) represents the first two terms, and g{uj) 
represents the last term in ( [T^ . Assume is the solution 
for the t-th iteration. The solution for the next 

iteration can be solved by subjecting it to 

V/(cc*+i) = Vgicu*). (15) 

A geometric explanation of CCCP is presented in Fig¬ 
ure where Vg{uj^) can be regarded as a hyperplane 
(the red line) at (the black spot) to upper bound 
—g{(jj). Vg{uj^) can be solved analytically once H is fixed. 
Then, the can be estimated accordingly by mini¬ 
mizing /(u;^+^). Please refer to 1431 for the theoretical 
background. 

During the training procedure, the model parameters 
and latent structures are iteratively updated. To 
discover the models structures, we add one step called 
model reconfiguration in each iteration. Recall that the 
model structures (e.g. graph nodes) are mapped with 


the feature vectors, as Figure illustrates. In this step, 
from the feature vectors of all positive training examples, 
we first extract the sub-vectors that are corresponding 
to different nodes (i.e. and-nodes or leaf-nodes), and 
each node, we perform clustering on these sub-vectors, 
respectively. Then, according to the clustering results, we 
rearrange each feature vector by placing the sub-vectors 
back into the feature vectors (e.g. re-assigning contour 
fragments to leaf-nodes). Consequently, the new model 
structures can be generated. Our DSO method iteratively 
performs with three following steps: (i) estimate the 
latent variables of training samples; (ii) reconfigure the 
model structures; (iii) update model parameters for the 
new structures. 



Fig. 7. Geometric illustration of the CCCP procedure. The 
target energy is decomposed into two functions, /(cj) and 
g{(jj). A\ each step of iteration, a hyperplane (represented 
by the red line) is calculated as the upper-bound at for 
optimizing 

(I) The model parameters ujt in the previous iteration 
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are fixed. We find a hyperplane qt to upper bound —g{uj) 
in Equation 

- g{^) < -g{^t) + - ^t) • g't, Vcj. (16) 

The optimal latent variables are specified for each 
positive training example by, 

HI = argmaxH{uJt • (t>{Xk, Vk, H)). (17) 

Note that we only take the positive training examples 
into account as (j){Xk,yk-,H) = 0 when yk = —1. That 
is, we apply the current model to perform detections on 
the training samples, and the hyperplane is constructed 
as 

N 

qt = -Xj2^i^k,yk,H;). (18) 

k=l 

(II) In the second step, we optimize the model struc¬ 
tures based on the estimated latent variables il*. All 
graph nodes in our model are mapped with several 
feature bins (i.e. sub-vectors) of (j){Xkjyk^ for all of 
the training samples, as Figure illustrates. Hence, we 
achieve the model reconfiguration process by rearrang¬ 
ing 0(X/c, yk,Hl). For example, we can remove leaf-node 
Lj by setting the corresponding bins for Lj into zeros. 
Specifically, two sub-steps are sequentially performed to 
generate and-nodes and leaf-nodes, respectively. 

(i) Global structure reconfiguration. In the layer of 
and-nodes, we perform clustering on the feature vectors 
corresponding to the and-nodes, i.e. the global shape 
features defined in Equation (|^. Note that each vector 
is a part of (j){Xk,yk,Hl). The training object shapes 
detected by the same and-node are initially grouped 
into one cluster. We then perform clustering on all of 
the feature vectors by using ISODATA with Euclidean 
distance. Based on the clustering result, we rearrange 
the feature vectors mapping with the and-nodes. For 
example, if one vector is grouped into a new cluster 

we shall move it into the bins corresponding to 
And-node Ar, and set its original bins as zeros. In our 
implementation, we fix the number of and-nodes as m, 
to simplify the computation. 

(ii) Local structure reconfiguration. After the global 
structure reconfiguration, each and-node is associated 
with a group of training examples. Suppose the and- 
node Ar includes a number of or-nodes, and every or- 
node Uj,j G ch{r) further derives its child leaf-nodes 
Lj^i G ch{j). In this step, we configure the part-level 
structures rooted by Uj. Note that this step processes 
each or-node and its leaf-nodes separately. 

Each or-node Uj specifies one part of the whole object 
shape. Given the training examples associating with Ar, 
we extract the local contour features from (j){Xk^ yk, Hk)f 
which are corresponding to the shape part of Uj. Then 
we perform clustering on these vectors, and rearrange 
these vectors in 0 (X/c, yk, Hj^), similarly as the operation 
on the and-nodes. In our implementation, the number of 
leaf-nodes is not fixed, as the local variances of shapes 
are usually unpredictable. Thus, there are two specific 


operators to generate the leaf-nodes according to the 
clustering. 

• One new leaf-node is created if an extra cluster is 
generated. 

• One leaf-node is removed if there are very few 
samples in the corresponding cluster. 

We present a toy example in Figure to illustrate the 
structure reconfiguration. For the sample A 3 , a part of 
its feature vector < ^ 5 ,..., > is grouped from one 

cluster into another while the values of the feature bins 
are moved from < ^ 5 ,..., > to < 0i,..., ^4 >. 

After the reconfiguration, the latent variables for each 
training example can be re-calculated, and denoted by 
in accordance with the arranged feature vectors 
(refer to Equation (Tt)). We denote the feature vectors for 
all examples by (f)AXk,yk, Then, the hyperplane is 
transformed accordingly, qf = -DY,k=i ^"^{Xk.yk, H^)- 


0 ( 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 ) 



(c) 


Cluster 1 


Cluster 1 



-.(pA 

1 

■>04>| 


.,04> 

1x3: <01,.. 

.,04> 

Cluster 2 


■=> |a'4:<0i,.. 

•.04>| 

Xi'. (05, .. 

■7 08) 

Cluster 2 

X3: (05, .. 

■7 08) 

1^2- (05^ ■■ 

1 

08) 1 


(b) 


Fig. 8. A toy example for structure reconfiguration. We 
consider 4 samples, Xi,..., X^, for training the structure 
of Ui (or Ar). (a) shows the feature vectors ^ of the 
samples associated with Ui (or Ar), and the intensity of 
the feature bin indicates the feature value, (b) illustrates 
the clustering performed with <f>’. The vector (</> 5 , • ■ ■ , </>8,) 
of X 2 is grouped from cluster 2 to cluster 1. (c) shows the 
adjusted feature vectors according to the clustering. Note 
that the model structure reconfiguration is realized by the 
rearrange of feature vectors, as we discuss in the text. 
This figure should be viewed in electronic form. 


(Ill) The newly generated model structures can be 
represented by the feature vectors t/fe, iJ^), and 

the model parameters can then be learned by solving 
Equation ( [T4] l, 

ojf = argmin^[f{uj) - 5 r(w)] (19) 

By substituting —g{uj) with the upper bound hyperplane 
qf, this optimization task can be transferred as, 

1 ^ 

min -\\ujf + D y'[mjK(w ■ 4>{Xk, y, H) + C{yk, y, H)) 

- 2 

-Lu-cp\Xk,yk,m)]. ( 20 ) 

We solve it as a standard structural SVM problem, as, 

u* = Dj2<v,H^HXk,y,H), (21) 

k,y,H 
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where Act^{Xk,y,H) = (X,,We 

calculate u* by maximizing the dual form in standard 
SVM, and we apply the cutting plane method ITU to 
solve it. 

With the estimated parameters ujf, the energy E{ujf) 
can be calculated for the new model, and we then 
compare it with the previous energy E{ujt) to verify 
the new model structures. If E{ujf) < E{ujt), we accept 
the new model structures and have cjt+i ^ uf. Other¬ 
wise, we keep the model structures as in the previous 
iteration and optimize the model parameters without 
the structure reconfiguration, i.e. by using qt instead: 
cjt+i = argmin^[f{u:) ^ u: • qt]. 

In this way, we ensure that the optimization objective 
in Equation continues to decrease in iterations. 
Thus, the algorithm keeps iterating until the objective 
converges. 

5.3 Initialization 

At the beginning of model training, our model can be 
initialized as follows. For each training example, whose 
contours have been extracted, we partition it into a 
regular layout of partitioned blocks, and each block is 
corresponding to one or-node. The contours that fall 
into the block are treated as the inputs, and we initially 
select the one with the largest length if more than 
one contour are within there. Then, the leaf-nodes are 
initially generated by clustering the selected contours 
without any constraints. The and-nodes are initialized 
by the similar way. We thus obtain the initial feature 
vectors for all training examples. 

Algorithm summarizes the overall algorithm of 
learning the latent And-Or graph. 

6 Experiments 

To validate the advantage of our model, we present 
a new shape database, SYSU-Shape^ which includes 
elaborately annotated shape contours. Compared with 
the existing shape databases, this database includes more 
realistic challenges in shape detection and localization, 
e.g. cluttered background, large intraclass variation, and 
different poses/views, in which part of the instances 
were originally used for appearance-based object detec¬ 
tion. We also validate our model on two other public 
databases: UlUC-People (331 and INRIA-Horse 11121 and 
show the superior performances over other state-of-the- 
art methods. 

Implementation setting. We extract clutter-free object 
contours for the positive samples, and the edge maps for 
the negative samples are extracted using the Pb edge de¬ 
tector Il23l with an edge link algorithm. For each contour 
as the input of the leaf-node, we sample 20 points and 
compute the contour descriptor for each point. During 
detection, the edge maps of test images are extracted 

1. http://vision.sysu.edu.cn/projects/discriminative-aog/ 


Algorithm 2: Learning latent And-Or graph model 

Input: 

positive and negative training samples, 

{Xk.Vk}^ , yfc'}“ X = k' = K + 1..N. 

Output: 

The trained And-Or graph model. 

Initialization: 

1 Initialize the model structure (the arrangement of nodes). 

2 Initialize the latent variables H and model parameters uj. 

repeat 

1 Estimate the latent variables if* on each positive example 
{Xk,yk) with the current model parameters ut- 

2 Generate the new graph structures. 

(a) Localize the contour fragments for all examples using the 
current latent variables and obtain the feature vectors 

: Vk ■) ^k )■ 

(b) In the layer of and-nodes, perform clustering on the global 
shape features, and rearrange the feature vectors. 

(c) For each or-node Ui, perform clustering on the feature 
vectors of all its child leaf-nodes. 

(d) Operate on the leaf-nodes to generate a new structure, and 
the latent variable is updated to with the rearranged 
feature vectors (f)^{Xk^ Hk^ -^/)- 

3 Update the model parameters ojt+i • 

(a) Estimate the parameters with the newly generated 
structures. 

(b) IF E{u;f) < E{uot), 

Accept the new model structures, and ut+i 
ELSE 


Calculate uot+i while keeping the structures in the 
previous iteration. 

until The target function defined in Equation (TT) converges. 



and-nodes 

or-nodes 

leaf-nodes 

SYSU-Shapes 

m = 3 
z = 6 
n < 4 

UlUC-People 

m = 2 
z = S 
n < 4 

INRIA-Horses 

m = 1 
z = 6 
n < 4 


TABLE 2 

Numbers of nodes in the and-or graph models for 
different databases. 


as for the negative training samples. The objects are 
searched by sliding windows over 6 different scales 
and 2 per octave, and detections are reported by non¬ 
maximum suppression. We adopt the testing criterion 
defined in the PASCAL VOC challenge: a detection is 
counted as correct if its overlap with the groundtruth 
bounding-box is greater than 50%. 

Our model is able to flexibly adapt to the data by 
setting the numbers of nodes in each layer: m for and- 
nodes, 2 ; for or-nodes, and n for leaf-nodes. Recall that 
each or-node in our model indicates a part of object 
shape, so that we can set the number of or-nodes accord¬ 
ing to the sizes (scales) of the shape categories. The leaf- 
nodes are produced during the iterative training, and 
their numbers can be determined automatically. In the 
experiments, to reduce computational cost, we fix the 
number for and-nodes and set an upper limit for the 
number of leaf-nodes. Table |2] summarizes the numbers 
of nodes on the three databases. In the model training, 
the initial layout for each sample is a regular partition 
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Fig. 9. Precision-Recall (PR) curves on the SYSU-Shape dataset. 


Method 

Airplane 

Bicycle 

Boat 

Car 

Motorbike 

MeanAP 

AOG (full) 

0.520 

0.623 

0.419 

0.549 

0.583 

0.539 

AOG (3-layers) 

0.348 

0.482 

0.288 

0.466 

0.333 

0.383 

DPMs 

0.437 

0.488 

0.365 

0.509 

0.455 

0.451 


TABLE 3 

Detection accuracies on the SYSU-Shape dataset. 


(e.g. 2x4 blocks for the UlUC-People dataset and 3x2 
for the other two datasets). 

If we keep only one and-node (i.e. m = 1), our model 
is simplified into a 3-layer structure that is rooted by 
the and-node. The training procedure (i.e. Algorithm. 
for this structure is kept, but we discard the step of 
generating and-nodes. 

We conduct the experiments on a workstation with 
Core Duo 3.0 GHZ CPU and 16GB memory. On average, 
it takes 4^8 hours to train a shape model, depending 
on the numbers of training examples, and the time cost 
for detection on an image is around 1^2 minutes. 

Experiment 1 . We first conduct the experiment on the 
SYSU-Shape database, which is collected from the Inter¬ 
net and other vision databases. There are 5 categories, 
i.e. airplanes, boats, cars, motorbikes, and bicycles, and 
each category contains 200 ^ 500 images. The shape 
contours are carefully labeled by a professional team 
using the LabelMe toolkit l26l . It is worth mentioning 
that each image has at least but not limited to one object 
of a given category. For each category, half of the images 
are randomly selected as positive samples and the rest 
for testing. The images from the other categories are 
randomly split into two halves as negative samples for 
training and testing. 


For comparison, we apply the well acknowledged de¬ 
formable part-based models (DPMs) IZ! on this database, 
where we modify the released code by replacing the in¬ 
put feature with our shape descriptor, and keep the other 
settings. In this implementation, 3 DPMs are merged 
into a mixture, which accounts for different object views. 
Moreover, we simplify the model into a 3-layer config¬ 
uration by setting m = 1, and test its performances. 
Figure shows the Precision-Recall curves for all 5 
categories, and the Average Precision values are reported 
in Table Our complete model achieves the best mean 
AP and the best APs for all 5 categories, and the results 
clearly demonstrate the benefit of using the layered And- 
Or structures. Several representative detection results are 
exhibited in Figure 

Experiment 11 . The UlUC-People dataset contains 593 
images (346 for training, 247 for testing) that are very 
challenging due to large shape variations caused by 
different views and human poses. Most of the images 
contain people playing badminton. The existing meth¬ 
ods 123,11 that are tested on this dataset usually rely on 
rich appearance-based image features and/or manually 
labeled prior models. To the best of our knowledge, this 
work is the first shape-based detector to achieve compa¬ 
rable performances on this dataset. Figure [^a) shows 
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Fig. 10. A few typical object shape detections generated by our approach on the SYSU-Shape dataset. The localized 
contours are highlighted in black, and the green boxes and red boxes indicate detected shapes and their parts, 
respectively. 


LJ root-node 
r~l and-node \ 
(~) or-node 
Q leaf-node 



O,[(gj00- 





(b) 


Fig. 11. The trained And-Or graph model with the UlUC-People dataset, (a) Visualizes the model of 4 layers, (b) 
Exhibits leaf-nodes associated with or-nodes, Ui,...,Us. A real detection case with the activated leaf-nodes are 
highlighted in red. 


Method 

Accuracy 

AOG model 

0.708 

Wang et al. 

0.668 

Andriluka et al. HI 

0.506 

Felz et al. (3 

0.486 

Bourdev et al. (U 

0.458 


TABLE 4 

Comparisons of detection accuracies on the 
UlUC-People dataset. 


the trained And-Or model (AOG), which includes 2 and- 
nodes and 8 or-nodes, and each or-node is associated 
with 2^4 leaf-nodes. Since most of the images contain 


one person, we only consider the detection with the 
highest score on an image for all of the methods. Table 
reports the quantitative detection accuracies generated 
by our method and the competing approaches ||37|, m, 
m, IZj. The results (except ours) come from ||37|. A num¬ 
ber of representative detection results are presented in 
Figure where the localized contours are highlighted 
in black, and the green boxes and red boxes indicate 
detected human and parts, respectively. We also present 
several inaccurate detections indicated by the blue boxes 
in Figure There are two main reasons for the fail¬ 
ure cases: (i) False positives are sometimes created by 
the background contours segments that appear like the 
objects-of-interest very much, (ii) The object contours are 
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Fig. 12. A few typical object shape detections generated by our method on the UlUC-People database The 
localized contours are highlighted in black, and the green boxes and red boxes indicate detected peoples and parts, 
respectively. Two failure detections are indicated by the blue boxes. 



(a) 


(b) 


Fig. 13. Results on the INRIA-Horse database, (a) shows several detected shapes by our method, where the localized 
contours are highlighted in black, and two failure detections are indicated by the blue boxes, (b) shows the quantitative 
results with the recall-FPPI measurement. 


insufficiently discriminative for recognition, particularly 
with unconventional object poses and views. 

Experiment III. The INRIA Horse dataset comprise 
170 horse images and 170 images without horses. The 
challenges of this dataset arise from background clutter 
and large deformations, and some of the images contain 
more than one horse. Following the common experiment 
setting, we use 50 positive examples and 80 negative 
examples for training and the remaining 210 images for 
testing. 

Some typical shape detection results on the INRIA 
Horse dataset are shown in Figure [^a). Compared 
with existing approaches, we use the recall-FPPI (false 
positive per image) curves for evaluation, as Figure p^b) 
reports. It is shown that our approach (denoted as AOG) 


substantially outperforms the competing methods. Our 
model achieves detection rates of 89.6% at 1.0 FPPI; in 
contrast, the results of competing methods are: 87.3% in 
ill, 85.27% in G2I, 80.77% in ||^, and 73.75% in Hi. 

Empirical analysis. For further evaluation, we present 
two empirical analysis under different model settings as 
follows. 

(I) We validate the benefit of the contextual collabo¬ 
rative edges. Our model can be further transferred into 
a tree structure by removing the interactions, which is 
denoted as "And-Or Tree (AOT)". On the UlUC-People 
dataset, the detection accuracy of the AOT model is 0.69, 
which is lower than the complete form of our model, but 
it is also comparable to the state-of-arts. On the INRIA- 
Horse dataset, we also present the results yielded by the 
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AOT model in Figure p^b). Based on these results, we 
can observe that the collaborative edges effectively boost 
the detection against disturbing surrounding clutter and 
occlusions. 



Fig. 14. Model capabilities during the iterative training. 
We plot the average-precision (AP) with the increasing 
iterations: the intermediate performances of our models 
in the iteration steps. We conduct the experiments on 
the UlUC-People dataset (on the left) and INRIA Horse 
dataset (on the right). The results of disabling the collab¬ 
orative edges are also reported. 

(11) To analyze the model capacity during the iterative 
training, we output the intermediate performance mea¬ 
sures of our models in the iteration steps. 

We execute the experiments on the UlUC-People and 
the INRIA-Horse databases. The quantitative results rep¬ 
resented by average precisions (APs) are visualized in 
Figure We also report the results generated by the 
models without collaborative edges, i.e. AOT models. We 
observe that the discriminative capabilities of our model 
increase proportinately with the iterations, and converge 
after a few rounds. 

7 Conclusion AND Future Work 

In this paper, we have introduced, first, a hierarchical 
and reconfigurable object shape model in the form of an 
And-Or graph representation. Second, an efficient infer¬ 
ence algorithm for shape detection with the proposed 
model. Third, a principled learning method that itera¬ 
tively determine the model structures while optimizing 
multi-layer parameters. We demonstrated the practical 
applicability of our approach by effectively detecting 
and localizing object shapes from cluttered edge maps. 
Our model effectively captured large shape variations 
in deformation for different views and poses. Experi¬ 
ments were implemented on several very challenging 
databases, (e.g. SYSU-Shapes, UlUC-People, and INRIA- 
Horse), and our model outperformed other current state- 
of-the-art approaches. 

There are several directions in which we intend to 
extend this work. The first is to complement our contour- 
based features with rich appearance information, thereby 
adapting our model to more general object recognition. 
The second is to generalize our model in the context of 
multiclass recognition and investigate part-based struc¬ 
ture sharing among classes. For example, the feet of 


horse and sheep have similar appearances, and thus can 
be detected by the same local classifier, that is, we can 
make local classifiers (i.e. the leaf-nodes in our model) 
shared across categories. Model sharing will keep the 
model compact while representing multiple categories. 
Moreover, the inference algorithm will be revised ac¬ 
cordingly, to deal with a large number of candidate 
compositions. 
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